This analysis will focus on building a regression model to predict house sale prices using Kaggle data. The process involves identifying key predictors, interpreting their impact, and validating the model.
library(tidyverse)
library(dplyr)
library(pander)
train = read.csv("train.csv", stringsAsFactors = TRUE)
The first thing I did was look at the pairs plot to see all notable patterns that had good slop.
This is what I found:
Slopes that look like they could be useful:
Smaller but potentially useful slops:
After that I looked at the data dictionary and looked for stuff that seemed like it might be useful logicly.
Here is what I found:
This could help with location
Both could have a lot of variables wrapped into one
OverallQual
OverallCond
This could show economic things because of timing
YearBuilt
YearRemodAdd
YrSold
MoSold
This might be able to show how having a basement/finished basement contributes
BsmtFinSF1
BsmtUnfSF
The number of baths total could be useful
BsmtFullBath
BsmtHalfBath
FullBath
HalfBath
I then used all of that new knowledge I found and starting making linear models. These models had various success rates. Bellow you can see a sample of some of the models I tried.
mylm1 = lm(SalePrice ~ GarageArea + KitchenQual + TotRmsAbvGrd + GrLivArea + OverallQual, data = train)
pander(summary(mylm1))
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 9356 | 9630 | 0.9716 | 0.3314 |
| GarageArea | 63.35 | 5.901 | 10.74 | 6.338e-26 |
| KitchenQualFa | -69519 | 8249 | -8.428 | 8.372e-17 |
| KitchenQualGd | -53291 | 4413 | -12.08 | 4.637e-32 |
| KitchenQualTA | -68226 | 5046 | -13.52 | 2.526e-39 |
| TotRmsAbvGrd | -1757 | 1110 | -1.583 | 0.1136 |
| GrLivArea | 52.89 | 3.899 | 13.57 | 1.47e-39 |
| OverallQual | 21394 | 1146 | 18.68 | 6.296e-70 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 38268 | 0.7691 | 0.768 |
mylm2 = lm(SalePrice ~ SaleCondition + X2ndFlrSF + X1stFlrSF + TotalBsmtSF, data = train)
pander(summary(mylm2))
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | -27636 | 6172 | -4.478 | 8.143e-06 |
| SaleConditionAdjLand | 6546 | 24018 | 0.2725 | 0.7852 |
| SaleConditionAlloca | 7555 | 14605 | 0.5173 | 0.605 |
| SaleConditionFamily | -9272 | 11519 | -0.8049 | 0.421 |
| SaleConditionNormal | 21874 | 4877 | 4.485 | 7.865e-06 |
| SaleConditionPartial | 69847 | 6432 | 10.86 | 1.829e-26 |
| X2ndFlrSF | 82.07 | 2.889 | 28.41 | 1.558e-141 |
| X1stFlrSF | 82.13 | 5.689 | 14.44 | 3.08e-44 |
| TotalBsmtSF | 57.42 | 5.041 | 11.39 | 7.594e-29 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 47039 | 0.6513 | 0.6494 |
mylm3 = lm(SalePrice ~ OverallQual + OverallCond + YearBuilt + YearRemodAdd + YrSold + MoSold, data = train)
pander(summary(mylm3))
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 499204 | 1924199 | 0.2594 | 0.7953 |
| OverallQual | 40654 | 1175 | 34.61 | 6.353e-192 |
| OverallCond | 985.9 | 1330 | 0.7415 | 0.4585 |
| YearBuilt | 217.5 | 64.37 | 3.379 | 0.0007474 |
| YearRemodAdd | 261.5 | 85.98 | 3.041 | 0.0024 |
| YrSold | -756 | 958.9 | -0.7885 | 0.4306 |
| MoSold | -234 | 471.2 | -0.4967 | 0.6195 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 48001 | 0.6364 | 0.6349 |
mylm4 = lm(SalePrice ~ BsmtFullBath + BsmtHalfBath + FullBath + HalfBath, data = train)
pander(summary(mylm4))
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 23972 | 5092 | 4.708 | 2.739e-06 |
| BsmtFullBath | 42546 | 3063 | 13.89 | 2.813e-41 |
| BsmtHalfBath | 18980 | 6652 | 2.853 | 0.004389 |
| FullBath | 79592 | 2883 | 27.61 | 2.954e-135 |
| HalfBath | 34458 | 3147 | 10.95 | 7.276e-27 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 59871 | 0.4336 | 0.432 |
mylm5 = lm(SalePrice ~ OverallQual:KitchenQual:OverallCond, data = train)
pander(summary(mylm5))
| Â | Estimate | Std. Error | t value |
|---|---|---|---|
| (Intercept) | 77804 | 5902 | 13.18 |
| OverallQual:KitchenQualEx:OverallCond | 5712 | 178.2 | 32.05 |
| OverallQual:KitchenQualFa:OverallCond | 1259 | 359 | 3.506 |
| OverallQual:KitchenQualGd:OverallCond | 3581 | 162.8 | 21.99 |
| OverallQual:KitchenQualTA:OverallCond | 2068 | 193 | 10.71 |
| Â | Pr(>|t|) |
|---|---|
| (Intercept) | 1.421e-37 |
| OverallQual:KitchenQualEx:OverallCond | 5.574e-171 |
| OverallQual:KitchenQualFa:OverallCond | 0.0004687 |
| OverallQual:KitchenQualGd:OverallCond | 9.659e-93 |
| OverallQual:KitchenQualTA:OverallCond | 7.801e-26 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 53992 | 0.5394 | 0.5381 |
After trying those, I finally thought to use the variables we did in class, as well as combining columns to perhaps make my own new columns.
You can see below what I did.
train <- train %>%
mutate(TotalSF = X1stFlrSF + X2ndFlrSF + TotalBsmtSF)
train <- train %>%
mutate(Totalbaths = BsmtFullBath + BsmtHalfBath + FullBath + HalfBath)
train <- train %>%
mutate(renovatedyn = ifelse(YearRemodAdd == YearBuilt, "No", "Yes"))
train <- train %>%
mutate(age_at_sale = YrSold - YearBuilt)
train <- train %>%
mutate(RichNbrhd = case_when(Neighborhood %in% c("StoneBr", "NridgHt", "NoRidge") ~ 1,
TRUE ~ 0))
train <- train %>%
mutate(OverallQual_group = ifelse(OverallQual <= 5, 5, OverallQual))
mylm6 = lm(SalePrice ~ TotalSF, data = train)
pander(summary(mylm6))
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | -13220 | 4251 | -3.11 | 0.001909 |
| TotalSF | 75.63 | 1.577 | 47.95 | 5.511e-302 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 49506 | 0.6119 | 0.6117 |
mylm7 = lm(SalePrice ~ TotalSF + Totalbaths + MSZoning + renovatedyn + age_at_sale, data = train)
pander(summary(mylm7))
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | -5369 | 14760 | -0.3638 | 0.7161 |
| TotalSF | 58.5 | 1.727 | 33.87 | 8.489e-186 |
| Totalbaths | 10425 | 1650 | 6.319 | 3.505e-10 |
| MSZoningFV | 34099 | 15104 | 2.258 | 0.02412 |
| MSZoningRH | 15329 | 17441 | 0.8789 | 0.3796 |
| MSZoningRL | 29896 | 13917 | 2.148 | 0.03186 |
| MSZoningRM | 24092 | 13999 | 1.721 | 0.08546 |
| renovatedynYes | 18960 | 2534 | 7.483 | 1.25e-13 |
| age_at_sale | -742.2 | 50.87 | -14.59 | 4.316e-45 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 43188 | 0.7061 | 0.7045 |
mylm8 = lm(SalePrice ~ TotalSF + OverallQual + Totalbaths + MSZoning + renovatedyn + YearBuilt + age_at_sale, data = train)
pander(summary(mylm8))
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 292076 | 1525175 | 0.1915 | 0.8482 |
| TotalSF | 38.58 | 1.835 | 21.03 | 6.996e-86 |
| OverallQual | 22794 | 1150 | 19.82 | 1.596e-77 |
| Totalbaths | 9224 | 1468 | 6.284 | 4.347e-10 |
| MSZoningFV | 6077 | 13482 | 0.4507 | 0.6522 |
| MSZoningRH | 1587 | 15507 | 0.1023 | 0.9185 |
| MSZoningRL | 14203 | 12380 | 1.147 | 0.2515 |
| MSZoningRM | 590 | 12488 | 0.04725 | 0.9623 |
| renovatedynYes | 13229 | 2267 | 5.834 | 6.654e-09 |
| YearBuilt | -188.5 | 759.6 | -0.2481 | 0.8041 |
| age_at_sale | -491.5 | 757.5 | -0.6488 | 0.5165 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 38334 | 0.7687 | 0.7672 |
I then looked over all my models and choice the following model.
Formula:
\[ \begin{aligned} SalePrice = & \ \beta_0 + \beta_1 (\text{RichNbrhd}) + \beta_2 (\text{TotalSF}) + \beta_3 (\text{OverallQual\_group}) \\ & + \beta_4 (\text{RichNbrhd} \times \text{TotalSF}) + \beta_5 (\text{RichNbrhd} \times \text{OverallQual\_group}) \\ & + \beta_6 (\text{TotalSF} \times \text{OverallQual\_group}) + \beta_7 (\text{RichNbrhd} \times \text{TotalSF} \times \text{OverallQual\_group}) + \epsilon \end{aligned} \]
Model Summary:
mylm_80 = lm(SalePrice ~ RichNbrhd*TotalSF*as.factor(OverallQual_group) + YearBuilt, data = train)
pander(summary(mylm_80))
| Â | Estimate | Std. Error |
|---|---|---|
| (Intercept) | -694411 | 61187 |
| RichNbrhd | -568990 | 47159 |
| TotalSF | 33.76 | 2.194 |
| as.factor(OverallQual_group)6 | -13805 | 8351 |
| as.factor(OverallQual_group)7 | -22768 | 10043 |
| as.factor(OverallQual_group)8 | -26973 | 17308 |
| as.factor(OverallQual_group)9 | -48468 | 112525 |
| as.factor(OverallQual_group)10 | 418510 | 28134 |
| YearBuilt | 384.2 | 31.3 |
| RichNbrhd:TotalSF | 148.3 | 8.66 |
| RichNbrhd:as.factor(OverallQual_group)6 | 733242 | 974427 |
| RichNbrhd:as.factor(OverallQual_group)7 | 487985 | 60834 |
| RichNbrhd:as.factor(OverallQual_group)8 | 570124 | 53513 |
| RichNbrhd:as.factor(OverallQual_group)9 | 397519 | 128680 |
| TotalSF:as.factor(OverallQual_group)6 | 12.83 | 3.509 |
| TotalSF:as.factor(OverallQual_group)7 | 23.17 | 3.782 |
| TotalSF:as.factor(OverallQual_group)8 | 32.17 | 5.52 |
| TotalSF:as.factor(OverallQual_group)9 | 50.69 | 31.94 |
| TotalSF:as.factor(OverallQual_group)10 | -64.14 | 4.555 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)6 | -206.5 | 364.8 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)7 | -118.2 | 15.29 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)8 | -143.5 | 11.32 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)9 | -98.19 | 34.61 |
| Â | t value | Pr(>|t|) |
|---|---|---|
| (Intercept) | -11.35 | 1.195e-28 |
| RichNbrhd | -12.07 | 5.353e-32 |
| TotalSF | 15.39 | 1.436e-49 |
| as.factor(OverallQual_group)6 | -1.653 | 0.09853 |
| as.factor(OverallQual_group)7 | -2.267 | 0.02354 |
| as.factor(OverallQual_group)8 | -1.558 | 0.1194 |
| as.factor(OverallQual_group)9 | -0.4307 | 0.6667 |
| as.factor(OverallQual_group)10 | 14.88 | 1.162e-46 |
| YearBuilt | 12.27 | 5.277e-33 |
| RichNbrhd:TotalSF | 17.13 | 5.419e-60 |
| RichNbrhd:as.factor(OverallQual_group)6 | 0.7525 | 0.4519 |
| RichNbrhd:as.factor(OverallQual_group)7 | 8.022 | 2.147e-15 |
| RichNbrhd:as.factor(OverallQual_group)8 | 10.65 | 1.46e-25 |
| RichNbrhd:as.factor(OverallQual_group)9 | 3.089 | 0.002045 |
| TotalSF:as.factor(OverallQual_group)6 | 3.657 | 0.0002647 |
| TotalSF:as.factor(OverallQual_group)7 | 6.128 | 1.151e-09 |
| TotalSF:as.factor(OverallQual_group)8 | 5.828 | 6.917e-09 |
| TotalSF:as.factor(OverallQual_group)9 | 1.587 | 0.1127 |
| TotalSF:as.factor(OverallQual_group)10 | -14.08 | 2.864e-42 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)6 | -0.5661 | 0.5714 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)7 | -7.726 | 2.077e-14 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)8 | -12.67 | 5.661e-35 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)9 | -2.837 | 0.004615 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 28884 | 0.8698 | 0.8678 |
Bellow is the model I choice to use for the purposes of this analyses.
mylm_80 = lm(SalePrice ~ RichNbrhd*TotalSF*as.factor(OverallQual_group) + YearBuilt, data = train)
pander(summary(mylm_80))
| Â | Estimate | Std. Error |
|---|---|---|
| (Intercept) | -694411 | 61187 |
| RichNbrhd | -568990 | 47159 |
| TotalSF | 33.76 | 2.194 |
| as.factor(OverallQual_group)6 | -13805 | 8351 |
| as.factor(OverallQual_group)7 | -22768 | 10043 |
| as.factor(OverallQual_group)8 | -26973 | 17308 |
| as.factor(OverallQual_group)9 | -48468 | 112525 |
| as.factor(OverallQual_group)10 | 418510 | 28134 |
| YearBuilt | 384.2 | 31.3 |
| RichNbrhd:TotalSF | 148.3 | 8.66 |
| RichNbrhd:as.factor(OverallQual_group)6 | 733242 | 974427 |
| RichNbrhd:as.factor(OverallQual_group)7 | 487985 | 60834 |
| RichNbrhd:as.factor(OverallQual_group)8 | 570124 | 53513 |
| RichNbrhd:as.factor(OverallQual_group)9 | 397519 | 128680 |
| TotalSF:as.factor(OverallQual_group)6 | 12.83 | 3.509 |
| TotalSF:as.factor(OverallQual_group)7 | 23.17 | 3.782 |
| TotalSF:as.factor(OverallQual_group)8 | 32.17 | 5.52 |
| TotalSF:as.factor(OverallQual_group)9 | 50.69 | 31.94 |
| TotalSF:as.factor(OverallQual_group)10 | -64.14 | 4.555 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)6 | -206.5 | 364.8 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)7 | -118.2 | 15.29 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)8 | -143.5 | 11.32 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)9 | -98.19 | 34.61 |
| Â | t value | Pr(>|t|) |
|---|---|---|
| (Intercept) | -11.35 | 1.195e-28 |
| RichNbrhd | -12.07 | 5.353e-32 |
| TotalSF | 15.39 | 1.436e-49 |
| as.factor(OverallQual_group)6 | -1.653 | 0.09853 |
| as.factor(OverallQual_group)7 | -2.267 | 0.02354 |
| as.factor(OverallQual_group)8 | -1.558 | 0.1194 |
| as.factor(OverallQual_group)9 | -0.4307 | 0.6667 |
| as.factor(OverallQual_group)10 | 14.88 | 1.162e-46 |
| YearBuilt | 12.27 | 5.277e-33 |
| RichNbrhd:TotalSF | 17.13 | 5.419e-60 |
| RichNbrhd:as.factor(OverallQual_group)6 | 0.7525 | 0.4519 |
| RichNbrhd:as.factor(OverallQual_group)7 | 8.022 | 2.147e-15 |
| RichNbrhd:as.factor(OverallQual_group)8 | 10.65 | 1.46e-25 |
| RichNbrhd:as.factor(OverallQual_group)9 | 3.089 | 0.002045 |
| TotalSF:as.factor(OverallQual_group)6 | 3.657 | 0.0002647 |
| TotalSF:as.factor(OverallQual_group)7 | 6.128 | 1.151e-09 |
| TotalSF:as.factor(OverallQual_group)8 | 5.828 | 6.917e-09 |
| TotalSF:as.factor(OverallQual_group)9 | 1.587 | 0.1127 |
| TotalSF:as.factor(OverallQual_group)10 | -14.08 | 2.864e-42 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)6 | -0.5661 | 0.5714 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)7 | -7.726 | 2.077e-14 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)8 | -12.67 | 5.661e-35 |
| RichNbrhd:TotalSF:as.factor(OverallQual_group)9 | -2.837 | 0.004615 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 28884 | 0.8698 | 0.8678 |
# ggplot(train, aes(y = SalePrice, x = TotalSF, color = as.factor(RichNbrhd))) +
# geom_point() +
# facet_wrap(~OverallQual_group) +
# geom_line(aes(y = mylm_80$fitted))
# ggplot(train, aes(y = SalePrice, x = TotalSF,
# color = as.factor(RichNbrhd),
# size = YearBuilt)) +
# geom_point() +
# facet_wrap(~OverallQual_group) +
# geom_line(aes(y = mylm_80$fitted), size = 1) # Ensures line has a fixed size
train <- train %>%
mutate(YearBuilt_group = ifelse(YearBuilt < 1900, "Pre-1900",
ifelse(YearBuilt < 1920, "1900s",
ifelse(YearBuilt < 1940, "1920s-30s",
ifelse(YearBuilt < 1960, "1940s-50s",
ifelse(YearBuilt < 1980, "1960s-70s",
ifelse(YearBuilt < 2000, "1980s-90s",
"2000s-2010s")))))))
ggplot(train, aes(y = SalePrice, x = TotalSF,
color = as.factor(RichNbrhd))) +
geom_point() +
facet_wrap(~OverallQual_group + YearBuilt_group) + # Facet by both OverallQual_group and decade
geom_line(aes(y = mylm_80$fitted), size = 1) # Ensures line has a fixed size
Bellow is the validation prosses of my model. Because the \(R^2\) didn’t change drastically we can be more assured that the model is good.
set.seed(121)
num_rows <- 1000 #1460 total
keep <- sample(1:nrow(train), num_rows)
mytrain <- train[keep, ] #Use this in the lm(..., data=mytrain) it is like "rbdata"
mytest <- train[-keep, ] #Use this in the predict(..., newdata=mytest) it is like "rbdata2"
mylm_80 = lm(SalePrice ~ RichNbrhd*TotalSF*as.factor(OverallQual_group) + YearBuilt, data = train)
# pander(summary(mylm_80))
yh_mylm_80 <- predict(mylm_80, newdata=mytest)
ybar <- mean(mytest$SalePrice)
SSTO <- sum( (mytest$SalePrice - ybar)^2 )
SSE_mylm_80 <- sum((mytest$SalePrice - yh_mylm_80)^2)
rs_mylm_80 <- 1 - SSE_mylm_80/SSTO
n <- nrow(mytest)
p_mylm_80 <- length(mylm_80)
p_mylm_80 <- length(coef(mylm_80)) # Number of estimated coefficients
rsa_mylm_80 <- 1 - ((n - 1) / (n - p_mylm_80)) * SSE_mylm_80 / SSTO
# rsa_mylm_80
# summary(mylm_80)
my_output_table2 <- data.frame(Model = c("My Model"),
`Original R2` = c(summary(mylm_80)$r.squared),
`Orig. Adj. R-squared` = c(summary(mylm_80)$adj.r.squared),
`Validation R-squared` = c(rs_mylm_80),
`Validation Adj. R^2` = c(rsa_mylm_80))
colnames(my_output_table2) <- c("Model", "Original $R^2$", "Original Adj. $R^2$", "Validation $R^2$", "Validation Adj. $R^2$")
knitr::kable(my_output_table2, escape=TRUE, digits=4)
| Model | Original \(R^2\) | Original Adj. \(R^2\) | Validation \(R^2\) | Validation Adj. \(R^2\) |
|---|---|---|---|---|
| My Model | 0.8698 | 0.8678 | 0.8727 | 0.8657 |
Below is the visualization of the model, where each line represents the different types of houses that the model takes into account.
For example, if a house has an Overall Quality score of 9, a rich neighborhood rating of 1, and a year built of, say, 2000, then you would look at the blue line for the chart labeled 9, 2000s-2010s. From there, you would find the total square footage of your house on the x-axis and then locate the estimated sales price on the y-axis. By doing all of this, you would have a pretty good idea of what your house might sell for.
There are caveats to this, of course, the biggest one being that the residual standard error is 30,350. This means the model is saying that your predicted model price will probably be around the number that aligns with the line, plus or minus about $28,880.00.
Because TotalSF is the on the x-axis anything the only thing that effects the slop is TotalSF and all the things interacts with it. Everything else will be changing the intercept.
# ggplot(train, aes(y = SalePrice, x = TotalSF,
# color = as.factor(RichNbrhd))) +
# geom_point() +
# facet_wrap(~OverallQual_group + YearBuilt_group) +
# geom_line(aes(y = mylm_80$fitted), size = 1)
train <- train %>%
mutate(YearBuilt_group = case_when(
YearBuilt < 1900 ~ "Pre-1900",
YearBuilt < 1920 ~ "1900s",
YearBuilt < 1940 ~ "1920s-30s",
YearBuilt < 1960 ~ "1940s-50s",
YearBuilt < 1980 ~ "1960s-70s",
YearBuilt < 2000 ~ "1980s-90s",
TRUE ~ "2000s-2010s"
),
OverallQual_group = case_when(
OverallQual <= 5 ~ 5,
TRUE ~ OverallQual
),
Combined_Group = case_when(
(OverallQual == 8 & YearBuilt_group == "1900s") |
(OverallQual == 9 & YearBuilt_group %in% c("Pre-1900", "1900s", "1920s-30s", "1940s-50s", "1960s-70s", "1980s-90s")) |
(OverallQual == 8 & YearBuilt_group == "Pre-1900") |
(OverallQual == 7 & YearBuilt_group == "1900s") |
(OverallQual == 6 & YearBuilt_group == "Pre-1900") |
(OverallQual == 10 & YearBuilt_group == "Pre-1900") ~ "Combined",
TRUE ~ paste(OverallQual_group, YearBuilt_group, sep = "_")
))
ggplot(train, aes(y = SalePrice, x = TotalSF, color = as.factor(RichNbrhd))) +
geom_point() +
facet_wrap(~ Combined_Group) +
geom_line(aes(y = mylm_80$fitted), size = 1)
*I had to combine some of the categories because of lack of data.